27 research outputs found

    MT-based sentence alignment for OCR-generated parallel texts

    Full text link
    The performance of current sentence alignment tools varies according to the to-be-aligned texts. We have found existing tools unsuitable for hard-to-align parallel texts and describe an alternative alignment algorithm. The basic idea is to use machine translations of a text and BLEU as a similarity score to find reliable alignments which are used as anchor points. The gaps between these anchor points are then filled using BLEU-based and length-based heuristics. We show that this approach outperforms state-of-the-art algorithms in our alignment task, and that this improvement in alignment quality translates into better SMT performance. Furthermore, we show that even length-based alignment algorithms profit from having a machine translation as a point of comparison

    Real Anaphora Resolution is Hard

    Get PDF
    We introduce a system for anaphora resolution for German that uses various resources in order to develop a real system as opposed to systems based on idealized assumptions, e.g. the use of true mentions only or perfect parse trees and perfect morphology. The components that we use to replace such idealizations comprise a full-fledged morphology, a Wikipedia-based named entity recognition, a rule-based dependency parser and a German wordnet. We show that under these conditions coreference resolution is (at least for German) still far from being perfect

    Machine translation of TV subtitles for large scale production

    Full text link
    This paper describes our work on building and employing Statistical Machine Translation systems for TV subtitles in Scandinavia. We have built translation systems for Danish, English, Norwegian and Swedish. They are used in daily subtitle production and translate large volumes. As an example we report on our evaluation results for three TV genres. We discuss our lessons learned in the system development process which shed interesting light on the practical use of Machine Translation technology

    Anaphora Resolution with Real Preprocessing

    Full text link
    In this paper we focus on anaphora resolution for German, a highly inflected language which also allows for closed form compounds (i.e. compounds without spaces). Especially, we describe a system that only uses real preprocessing components, e.g. a dependency parser, a two-level morphological analyser etc. We trace the performance drop occurring under these conditions back to underspecification and ambiguity at the morphological level. A demanding subtask of anaphora resolution are the so-called bridging anaphora, a special variant of nominal anaphora where the heads of the coreferent noun phrases do not match. We experiment with two different resources in order to find out how to cope best with this problem

    Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

    Get PDF
    Multi-head self-attention is a key component of the Transformer, a state-of-the-art architecture for neural machine translation. In this work we evaluate the contribution made by individual attention heads in the encoder to the overall performance of the model and analyze the roles played by them. We find that the most important and confident heads play consistent and often linguistically-interpretable roles. When pruning heads using a method based on stochastic gates and a differentiable relaxation of the L0 penalty, we observe that specialized heads are last to be pruned. Our novel pruning method removes the vast majority of heads without seriously affecting performance. For example, on the English-Russian WMT dataset, pruning 38 out of 48 encoder heads results in a drop of only 0.15 BLEU.Comment: ACL 2019 (camera-ready

    The UZH system combination system for WMT 2011

    Get PDF
    This paper describes the UZH system that was used for the WMT 2011 system combination shared task submission. We participated in the system combination task for the translation directions DE-EN and EN-DE. The system uses Moses as a backbone, with the outputs of the 2-3 best individual systems being integrated through additional phrase tables. The system compares well to other system combination submissions, with no other submission being significantly better. A BLEU-based comparison to the individual systems, however, indicates that it achieves no significant gains over the best individual system

    Combining multi-engine machine translation and online learning through dynamic phrase tables

    Get PDF
    Extending phrase-based Statistical Machine Translation systems with a second, dynamic phrase table has been done for multiple purposes. Promising results have been reported for hybrid or multi-engine machine translation, i.e.\ building a phrase table from the knowledge of external MT systems, and for online learning. We argue that, in prior research, dynamic phrase tables are not scored optimally because they may be of small size, which makes the Maximum Likelihood Estimation of translation probabilities unreliable. We propose basing the scores on frequencies from both the dynamic corpus and the primary corpus instead, and show that this modification significantly increases performance. We also explore the combination of multi-engine MT and online learning

    Iterative, MT-based sentence alignment of parallel texts

    Get PDF
    Recent research has shown that MT-based sentence alignment is a robust approach for noisy parallel texts. However, using Machine Translation for sentence alignment causes a chicken-and-egg problem: to train a corpus-based MT system, we need sentence-aligned data, and MT-based sentence alignment depends on an MT system. We describe a bootstrapping approach to sentence alignment that resolves this circular dependency by computing an initial alignment with length-based methods. Our evaluation shows that iterative MT-based sentence alignment significantly outperforms widespread alignment approaches on our evaluation set, without requiring any linguistic resources other than the to-be-aligned bitext

    Disambiguation of English contractions for machine translation of TV subtitles

    Get PDF
    This paper presents a disambiguation method for English apostrophe+s contractions. They occur frequently in subtitles and pose special difficulties for Machine Translation. We propose to disambiguate these contractions in a preprocessing step and show that this leads to improved translation quality

    From historic books to annotated XML: Building a large multilingual diachronic corpus

    Get PDF
    This paper introduces our approach towards annotating a large heritage corpus, which spans over 100 years of alpine literature. The corpus consists of over 16.000 articles from the yearbooks of the Swiss Alpine Club, 60% of which represent German texts, 38% French, 1% Italian and the remaining 1% Swiss German and Romansh. The present work describes the inherent difficulties in processing a multilingual corpus by referring to the most challenging annotation phases such as article identification, correction of optical character recognition (OCR) errors, tokenization, and language identification. The paper aims to raise awareness for the efforts in building and annotating multilingual corpora rather than to evaluate each individual annotation phase. Keywords: multilingual corpora, cultural heritage, corpus annotation, text digitizatio
    corecore